Describe dataset

Exploratory Data Analysis (notes)

1) Inspection of single variables

- Summary statistics: 5-number summary, mean, variance, stem-and-leaf display, box-plot, histograms, density plots


2) Bi- and multivariate inspection

- Contingency tables, mosaic plot, scatter-plot, group stratified plots


3) Quantification of the above

- Association: suitable measures e.g. covariance, Pearson or rank correlation.
- Difference: suitable hypothesis tests, e.g. t-test, Wilcoxon signed rank test
In [667]:
# Libraries
import pandas as pd
import numpy  as np
from matplotlib import pyplot as plt
import datetime as dt
import seaborn as sns
from pandas import Series
from scipy.stats import ttest_ind
import statsmodels.formula.api as sm
from sklearn.preprocessing import Imputer

1) Reviewing data, initial analysis

In [445]:
plt.figure(figsize=(50, 50), dpi= 100, facecolor='w', edgecolor='k')
Out[445]:
<matplotlib.figure.Figure at 0x1a201bf320>
<matplotlib.figure.Figure at 0x1a201bf320>
In [446]:
feature_df = pd.read_csv("features.csv")
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
store_df = pd.read_csv("stores.csv")
In [447]:
feature_df.head()
Out[447]:
Store Date Temperature Fuel_Price Promotion1 Promotion2 Promotion3 Promotion4 Promotion5 CPI Unemployment IsHoliday
0 1 05/02/2010 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 False
1 1 12/02/2010 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106 True
2 1 19/02/2010 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106 False
3 1 26/02/2010 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106 False
4 1 05/03/2010 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106 False
In [448]:
feature_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8190 entries, 0 to 8189
Data columns (total 12 columns):
Store           8190 non-null int64
Date            8190 non-null object
Temperature     8190 non-null float64
Fuel_Price      8190 non-null float64
Promotion1      4032 non-null float64
Promotion2      2921 non-null float64
Promotion3      3613 non-null float64
Promotion4      3464 non-null float64
Promotion5      4050 non-null float64
CPI             7605 non-null float64
Unemployment    7605 non-null float64
IsHoliday       8190 non-null bool
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 711.9+ KB
In [449]:
train_df.head()
Out[449]:
Store Dept Date Weekly_Sales IsHoliday
0 1 1 05/02/2010 24924.50 False
1 1 1 12/02/2010 46039.49 True
2 1 1 19/02/2010 41595.55 False
3 1 1 26/02/2010 19403.54 False
4 1 1 05/03/2010 21827.90 False
In [450]:
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 5 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Date            421570 non-null object
Weekly_Sales    421570 non-null float64
IsHoliday       421570 non-null bool
dtypes: bool(1), float64(1), int64(2), object(1)
memory usage: 13.3+ MB
In [451]:
test_df.head()
Out[451]:
Store Dept Date IsHoliday Weekly_Sales to be predicted
0 1 1 02/11/2012 False NaN
1 1 1 09/11/2012 False NaN
2 1 1 16/11/2012 False NaN
3 1 1 23/11/2012 True NaN
4 1 1 30/11/2012 False NaN
In [452]:
test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115064 entries, 0 to 115063
Data columns (total 5 columns):
Store                           115064 non-null int64
Dept                            115064 non-null int64
Date                            115064 non-null object
IsHoliday                       115064 non-null bool
Weekly_Sales to be predicted    0 non-null float64
dtypes: bool(1), float64(1), int64(2), object(1)
memory usage: 3.6+ MB
In [453]:
store_df.head()
Out[453]:
Store Type Size (sq ft)
0 1 A 151315
1 2 A 202307
2 3 B 37392
3 4 A 205863
4 5 B 34875
In [454]:
store_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 3 columns):
Store           45 non-null int64
Type            45 non-null object
Size (sq ft)    45 non-null int64
dtypes: int64(2), object(1)
memory usage: 1.1+ KB

We first need to analyse the datatypes for each variable stored in the tables (columns).

Feature_df (8189 rows)

  • Date: needs to be converted into a datetime dtype as we can't use it currently in any graph as a type error would be thrown.
  • IsHoliday: is a categorical dtype, we can definitely use it to categorise data and compare relations between holidays and non-holidays.

Train_df (421569 rows)

  • Date: needs to be converted into datetime dtype for reasons mentiond above.

Test_df (115063 rows)

  • Date: needs to be converted into datetime dtype for reasons mentiond above.
In [455]:
# Check if Null values exist
def null_checker(dataframe_column):
    return dataframe_column.isnull().sum().sum()

Using the null_checker function we wrote above, we can now tabulate the variables with missing values for all tables to get a better picture of missing data and its potential impact:

Stores Features Test Train
Store 0 0 0 0
Type 0
Size 0
Date 0 0 0
Temp 0
Fuel_P 0
Promotion1 4158
Promotion2 5269
Promotion3 4577
Promotion4 4726
Promotion5 4140
CPI 585
Unemployment 585
IsHoliday 0 0 0
Department 0 0
Weekly_Sales 0
Weekly_Salestbp 115064

2) Some initial exploratory analysis of the data before we merge the tables.

In [456]:
# Lets output the summary statistics of the tables we want to focus on
feature_df.describe()
Out[456]:
Store Temperature Fuel_Price Promotion1 Promotion2 Promotion3 Promotion4 Promotion5 CPI Unemployment
count 8190.000000 8190.000000 8190.000000 4032.000000 2921.000000 3613.000000 3464.000000 4050.000000 7605.000000 7605.000000
mean 23.000000 59.356198 3.405992 7032.371786 3384.176594 1760.100180 3292.935886 4132.216422 172.460809 7.826821
std 12.987966 18.678607 0.431337 9262.747448 8793.583016 11276.462208 6792.329861 13086.690278 39.738346 1.877259
min 1.000000 -7.290000 2.472000 -2781.450000 -265.760000 -179.260000 0.220000 -185.170000 126.064000 3.684000
25% 12.000000 45.902500 3.041000 1577.532500 68.880000 6.600000 304.687500 1440.827500 132.364839 6.634000
50% 23.000000 60.710000 3.513000 4743.580000 364.570000 36.260000 1176.425000 2727.135000 182.764003 7.806000
75% 34.000000 73.880000 3.743000 8923.310000 2153.350000 163.150000 3310.007500 4832.555000 213.932412 8.567000
max 45.000000 101.950000 4.468000 103184.980000 104519.540000 149483.310000 67474.850000 771448.100000 228.976456 14.313000
In [457]:
train_df.describe()
Out[457]:
Store Dept Weekly_Sales
count 421570.000000 421570.000000 421570.000000
mean 22.200546 44.260317 15981.258123
std 12.785297 30.492054 22711.183519
min 1.000000 1.000000 -4988.940000
25% 11.000000 18.000000 2079.650000
50% 22.000000 37.000000 7612.030000
75% 33.000000 74.000000 20205.852500
max 45.000000 99.000000 693099.360000

Both tables feature and train are analysed further using the 5 number summary and the percentage of NaN data. The fuel_price and unemployment are closer to the expected value (mean) where as Promotions 1 to 5 have a high standard deviation meaning they are spread out over a wider range of values. The reason for this is because Promotions 1 to 5 have a lot of missing data. So does CPI thus it has a high standard deviation.

In the table train the store variable has a very similar standard deviation with the store variable in feature_df meaning they can be merged. We have over +421000 weekly sales records, this can help us predict future sales using an ML model.

I didn't output the 5 number summary for tables test or stores because they hardly contain any variables that can be statistically analysed, mainly its categorical data not continuous.

In [458]:
feature_df  = feature_df.fillna(0)

Standard deviation

A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.

In [459]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5))  # 1 row, 2 columns
train_df.plot.scatter(x = "Store", y = "Weekly_Sales", color = 'DarkBlue', ax=ax1)
train_df.plot.scatter(x = "Dept", y = "Weekly_Sales", color = 'DarkBlue', ax=ax2)
plt.tight_layout()
In [657]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
#sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Store', y ='Weekly_Sales', data = train_df, ax=ax1)
sns.lineplot(x = 'Dept', y ='Weekly_Sales', data = train_df, ax=ax2)
plt.tight_layout() 
plt.show()
In [461]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5))
feature_df.plot.scatter(x = "Store", y = "Promotion1", color = 'DarkGreen', ax=ax1)
feature_df.plot.scatter(x = "Store", y = "Promotion2", color = 'DarkGreen', ax=ax2)
plt.tight_layout() 
In [462]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5))
feature_df.plot.scatter(x = "CPI", y = "Promotion1", color = 'DarkGreen', ax=ax1)
feature_df.plot.scatter(x = "CPI", y = "Promotion5", color = 'DarkGreen',ax=ax2)
plt.tight_layout() 
In [463]:
# We will visualise continuity with line plot graphs
feature_df['Date'] =  pd.to_datetime(feature_df['Date'],
                              format='%d/%m/%Y')
In [656]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
#sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Promotion1', data = feature_df, ax=ax1)
sns.lineplot(x = 'Date', y ='Promotion2', data = feature_df, ax=ax2)
plt.tight_layout() 
plt.show()

We have no Promotion1 or Promotion2 data for years 2010 and 2011 and then a spike in Promotions for 2012 and later in 2013.

In [655]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
#sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Promotion3', data = feature_df, ax=ax1)
sns.lineplot(x = 'Date', y ='Promotion4', data = feature_df, ax=ax2)
plt.tight_layout() 
plt.show()
In [654]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
#sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='CPI', data = feature_df, ax=ax1)
sns.lineplot(x = 'Date', y ='Promotion5', data = feature_df, ax=ax2)
plt.tight_layout() 
plt.show()
In [467]:
# Covariance calculation
columns = ['Store', 'Temperature', 'Fuel_Price', 'Promotion1', 'Promotion2', 'Promotion3', 'Promotion4', 'Promotion5', 'CPI', 'Unemployment']
feature_df[columns].cov()
Out[467]:
Store Temperature Fuel_Price Promotion1 Promotion2 Promotion3 Promotion4 Promotion5 CPI Unemployment
Store 168.687263 -4.989264 0.373565 -8.337152e+03 -3.651613e+03 -2.736468e+03 -3.427193e+03 -1.006021e+03 -100.299091 5.006917
Temperature -4.989264 348.890354 0.816587 -1.583846e+04 -2.259900e+04 -8.842017e+03 -7.445559e+03 -3.470960e+03 -56.073624 -2.775438
Fuel_Price 0.373565 0.816587 0.186051 8.381261e+02 7.129273e+01 9.219665e+01 3.022976e+02 4.076271e+02 -5.662715 -0.144643
Promotion1 -8337.152336 -15838.463844 838.126086 5.459608e+07 6.382950e+06 -1.028792e+06 2.874332e+07 1.242641e+07 -17583.402352 -2625.933908
Promotion2 -3651.613028 -22599.003490 71.292731 6.382950e+06 3.020114e+07 -6.799934e+05 1.774877e+06 2.590959e+06 6947.946632 -318.754005
Promotion3 -2736.468246 -8842.017389 92.196654 -1.028792e+06 -6.799934e+05 5.685090e+07 -5.026811e+05 7.170161e+05 6119.991482 -135.797915
Promotion4 -3427.192765 -7445.559475 302.297581 2.874332e+07 1.774877e+06 -5.026811e+05 2.215690e+07 5.309252e+06 -14861.276798 -1258.980667
Promotion5 -1006.020707 -3470.959744 407.627060 1.242641e+07 2.590959e+06 7.170161e+05 5.309252e+06 8.894797e+07 -10020.404904 -1495.195493
CPI -100.299091 -56.073624 -5.662715 -1.758340e+04 6.947947e+03 6.119991e+03 -1.486128e+04 -1.002040e+04 3439.299976 68.543342
Unemployment 5.006917 -2.775438 -0.144643 -2.625934e+03 -3.187540e+02 -1.357979e+02 -1.258981e+03 -1.495195e+03 68.543342 7.335949

When we caclulate the covariance of all the numeric columns in our feature_df we can see there are:

  • 1 positive linear relationship between Temperature and fuel_price.
  • fuel_price on the other hand has a positive relationship with store, temperature and promotions 1 to 5.
  • Promotion1 has a positive linear relationship with fuel_price, promotions 2, 4 and 5.
  • Promotion2 has a covariance with fuel_price, promotions 1, 4 and 5 and CPI.
  • Promotion3 has a covariance with: fuel_price, Promotion5 and CPI.
  • Promotion4 has a positive covariance with: fuel_price, Promotions 1, 2 and 5.
  • Promotion5 has a positive covariance with: fuel_price and Promotions 1 to 4.
  • CPI has a positive covariance with: Promotion2, Promotion3 and Unemployment.
  • Unemployment has a positive covariance with: Store and CPI.

In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive.

Given the covariance relationships above, it would be interesting to further investigate the relationship between CPI and Unemployment.

In [468]:
feature_df.plot.scatter(x = "CPI", y = "Unemployment", color = 'DarkGreen')
Out[468]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a35834588>

The graphs above clearly shows that the major problem with this dataset is missing values this problem can then morpth into other areas and cause a lot of hassle in our future predictions and model making. The more data we have, the more trained our models can be and this can lead to more accurate results in the future analysis. I think it is paramount to try fill in some data using Deep Leaning techniques like Imputation. We must then compare the above analysis with the filled dataset.

We can see after running the function null_checker that the number of empty rows for promotions 1..5 vary. The promotion with the most values filled is Promotion 5

In [469]:
feature_df.Promotion1
Out[469]:
0           0.00
1           0.00
2           0.00
3           0.00
4           0.00
5           0.00
6           0.00
7           0.00
8           0.00
9           0.00
10          0.00
11          0.00
12          0.00
13          0.00
14          0.00
15          0.00
16          0.00
17          0.00
18          0.00
19          0.00
20          0.00
21          0.00
22          0.00
23          0.00
24          0.00
25          0.00
26          0.00
27          0.00
28          0.00
29          0.00
          ...   
8160     1341.33
8161     3877.36
8162    14746.10
8163     3130.28
8164    14508.96
8165    53311.88
8166     9362.02
8167    10781.51
8168     6614.32
8169    16382.54
8170     9867.03
8171    11923.74
8172     5444.00
8173    16427.83
8174     8760.15
8175     1399.81
8176     1260.65
8177     8345.40
8178     4689.18
8179     4515.35
8180     3249.34
8181     6474.49
8182     9977.82
8183     2471.44
8184     4989.34
8185     4842.29
8186     9090.48
8187     3789.94
8188     2961.49
8189      212.02
Name: Promotion1, Length: 8190, dtype: float64

Regarding quality of the datasets. The values stored is mainly floating point but what does that indicate specifically? Is it a percentage of promotion? An example of data for Promotion1 is 53311.88. Clearly it's price reductions still the values are not very informative. We have some categorical data which would be very useful i.e. IsHoliday to try filter the data focusing on holiday promotions and non-holiday promotions. Weekly sales to be predicted is all Null but the number of rows can give us a picture of the amount of data we have to produce which is: 115064. We definitely will be using Weekly_Sales with Promotions and IsHolidays with Dates to try predict the Sales Forecasting on the Department level. The data quality is ok I guess as many key variables required to predict the Sales forecasts are provided like Department, IsHoliday, WeeklySales, Date and so on. The main problem is we don't have all the data for Promotions this too is a key variable and without it, we can't make accurate predictions of weekly sales. At least, however, we have some data thus we can try predict the missing data using methods like time series analysis.

Let us now merge the features, train and store tables

First we will convert the Date columns to DateTime dtypes

In [470]:
train_df['Date'] = pd.to_datetime(train_df['Date'], format='%d/%m/%Y')
In [471]:
test_df['Date'] = pd.to_datetime(test_df['Date'], format = '%d/%m/%Y')
In [472]:
# Merging tables store_df
store_feature_merged_df = pd.merge(feature_df, store_df, on='Store')
In [473]:
store_feature_merged_df.head()
Out[473]:
Store Date Temperature Fuel_Price Promotion1 Promotion2 Promotion3 Promotion4 Promotion5 CPI Unemployment IsHoliday Type Size (sq ft)
0 1 2010-02-05 42.31 2.572 0.0 0.0 0.0 0.0 0.0 211.096358 8.106 False A 151315
1 1 2010-02-12 38.51 2.548 0.0 0.0 0.0 0.0 0.0 211.242170 8.106 True A 151315
2 1 2010-02-19 39.93 2.514 0.0 0.0 0.0 0.0 0.0 211.289143 8.106 False A 151315
3 1 2010-02-26 46.63 2.561 0.0 0.0 0.0 0.0 0.0 211.319643 8.106 False A 151315
4 1 2010-03-05 46.50 2.625 0.0 0.0 0.0 0.0 0.0 211.350143 8.106 False A 151315
In [474]:
store_feature_merged_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8190 entries, 0 to 8189
Data columns (total 14 columns):
Store           8190 non-null int64
Date            8190 non-null datetime64[ns]
Temperature     8190 non-null float64
Fuel_Price      8190 non-null float64
Promotion1      8190 non-null float64
Promotion2      8190 non-null float64
Promotion3      8190 non-null float64
Promotion4      8190 non-null float64
Promotion5      8190 non-null float64
CPI             8190 non-null float64
Unemployment    8190 non-null float64
IsHoliday       8190 non-null bool
Type            8190 non-null object
Size (sq ft)    8190 non-null int64
dtypes: bool(1), datetime64[ns](1), float64(9), int64(2), object(1)
memory usage: 903.8+ KB
In [475]:
# Merging tables store_feature_merged_df with train_df
final_df = pd.merge(train_df, store_feature_merged_df)
In [476]:
final_df.head()
Out[476]:
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price Promotion1 Promotion2 Promotion3 Promotion4 Promotion5 CPI Unemployment Type Size (sq ft)
0 1 1 2010-02-05 24924.50 False 42.31 2.572 0.0 0.0 0.0 0.0 0.0 211.096358 8.106 A 151315
1 1 2 2010-02-05 50605.27 False 42.31 2.572 0.0 0.0 0.0 0.0 0.0 211.096358 8.106 A 151315
2 1 3 2010-02-05 13740.12 False 42.31 2.572 0.0 0.0 0.0 0.0 0.0 211.096358 8.106 A 151315
3 1 4 2010-02-05 39954.04 False 42.31 2.572 0.0 0.0 0.0 0.0 0.0 211.096358 8.106 A 151315
4 1 5 2010-02-05 32229.38 False 42.31 2.572 0.0 0.0 0.0 0.0 0.0 211.096358 8.106 A 151315
In [477]:
final_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 421570 entries, 0 to 421569
Data columns (total 16 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Date            421570 non-null datetime64[ns]
Weekly_Sales    421570 non-null float64
IsHoliday       421570 non-null bool
Temperature     421570 non-null float64
Fuel_Price      421570 non-null float64
Promotion1      421570 non-null float64
Promotion2      421570 non-null float64
Promotion3      421570 non-null float64
Promotion4      421570 non-null float64
Promotion5      421570 non-null float64
CPI             421570 non-null float64
Unemployment    421570 non-null float64
Type            421570 non-null object
Size (sq ft)    421570 non-null int64
dtypes: bool(1), datetime64[ns](1), float64(10), int64(3), object(1)
memory usage: 51.9+ MB
In [478]:
final_df['Date'].min()
Out[478]:
Timestamp('2010-02-05 00:00:00')
In [479]:
final_df['Date'].max()
Out[479]:
Timestamp('2012-10-26 00:00:00')

Analysis of Weekly_Sales

In [480]:
columns = ['Weekly_Sales', 'Temperature', 'Fuel_Price', 'Promotion1', 'Promotion2', 'Promotion3', 'Promotion4', 'Promotion5', 'CPI', 'Unemployment']
final_df[columns].cov()
Out[480]:
Weekly_Sales Temperature Fuel_Price Promotion1 Promotion2 Promotion3 Promotion4 Promotion5 CPI Unemployment
Weekly_Sales 5.157979e+08 -968.855983 -1.252687 6.484161e+06 2.392253e+06 4.842130e+06 3.313949e+06 4.822476e+06 -18606.462689 -1094.491917
Temperature -9.688560e+02 340.326164 1.216851 -2.949355e+03 -1.685314e+04 -5.714476e+03 -3.612483e+03 -1.145115e+03 131.558813 3.324999
Fuel_Price -1.252687e+00 1.216851 0.210236 8.243629e+02 6.796596e+01 4.719006e+01 2.975366e+02 4.156004e+02 -2.948414 -0.028922
Promotion1 6.484161e+06 -2949.355260 824.362893 3.663138e+07 5.381307e+06 -4.822315e+05 1.977398e+07 1.056975e+07 2586.874690 -1186.023528
Promotion2 2.392253e+06 -16853.135807 67.965960 5.381307e+06 2.585253e+07 -1.709261e+05 2.242564e+06 2.818321e+06 -707.605631 -392.482102
Promotion3 4.842130e+06 -5714.476486 47.190058 -4.822315e+05 -1.709261e+05 3.056844e+07 -2.588210e+05 9.880257e+05 -1264.146323 -186.234523
Promotion4 3.313949e+06 -3612.483248 297.536579 1.977398e+07 2.242564e+06 -2.588210e+05 1.516736e+07 4.971238e+06 -312.203787 -555.227805
Promotion5 4.822476e+06 -1145.115137 415.600444 1.056975e+07 2.818321e+06 9.880257e+05 4.971238e+06 1.770414e+07 11188.700707 -943.992332
CPI -1.860646e+04 131.558813 -2.948414 2.586875e+03 -7.076056e+02 -1.264146e+03 -3.122038e+02 1.118870e+04 1533.448867 -21.886169
Unemployment -1.094492e+03 3.324999 -0.028922 -1.186024e+03 -3.924821e+02 -1.862345e+02 -5.552278e+02 -9.439923e+02 -21.886169 3.471872
In [481]:
# Pairwise correlation: seeing which pair is more relevant or which are the same etc.
final_df[columns].corr()
Out[481]:
Weekly_Sales Temperature Fuel_Price Promotion1 Promotion2 Promotion3 Promotion4 Promotion5 CPI Unemployment
Weekly_Sales 1.000000 -0.002312 -0.000120 0.047172 0.020716 0.038562 0.037467 0.050465 -0.020921 -0.025864
Temperature -0.002312 1.000000 0.143859 -0.026415 -0.179672 -0.056026 -0.050281 -0.014752 0.182112 0.096730
Fuel_Price -0.000120 0.143859 1.000000 0.297056 0.029153 0.018615 0.166622 0.215420 -0.164210 -0.033853
Promotion1 0.047172 -0.026415 0.297056 1.000000 0.174868 -0.014411 0.838904 0.415050 0.010915 -0.105168
Promotion2 0.020716 -0.179672 0.029153 0.174868 1.000000 -0.006080 0.113250 0.131735 -0.003554 -0.041427
Promotion3 0.038562 -0.056026 0.018615 -0.014411 -0.006080 1.000000 -0.012020 0.042471 -0.005839 -0.018078
Promotion4 0.037467 -0.050281 0.166622 0.838904 0.113250 -0.012020 1.000000 0.303370 -0.002047 -0.076513
Promotion5 0.050465 -0.014752 0.215420 0.415050 0.131735 0.042471 0.303370 1.000000 0.067906 -0.120406
CPI -0.020921 0.182112 -0.164210 0.010915 -0.003554 -0.005839 -0.002047 0.067906 1.000000 -0.299953
Unemployment -0.025864 0.096730 -0.033853 -0.105168 -0.041427 -0.018078 -0.076513 -0.120406 -0.299953 1.000000

It's clear from the table above, that the promotions affect the weekly_sales the most. It makes sense as this sheds light on customer behaviour because we are more prone to shop when promotions are available. We are more inclined to purchase more when promotions are available in comparison to times when promotions are not available.

In [482]:
# Difference test: t-test for holidays and non holidays sales:
holiday_true = final_df[final_df['IsHoliday'] == True]
holiday_false = final_df[final_df['IsHoliday'] == False]

ttest_ind(holiday_true['Weekly_Sales'], holiday_false['Weekly_Sales'])
Out[482]:
Ttest_indResult(statistic=8.2947568539318901, pvalue=1.091222267743316e-16)

The t-value measures the size of the difference relative to the variation in your sample data.

  • T is simply the calculated difference represented in units of standard error. The greater the magnitude of T, the greater the evidence against the null hypothesis. This means there is greater evidence that there is a significant difference. The closer T is to 0, the more likely there isn't a significant difference.
  • The p-value is a number between 0 and 1 and interpreted in the following way: A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.

Run regression model before imputation

In [483]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import label
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
In [484]:
regression_before_imputation = pd.concat([final_df, test_df])
In [485]:
regression_before_imputation = regression_before_imputation.fillna(0)
In [486]:
# Use min max scaler
scaler = MinMaxScaler()
DF_Scaled_1 = scaler.fit_transform(regression_before_imputation[columns])
DF_Scaled_1 = pd.DataFrame(data=DF_Scaled_1, columns=regression_before_imputation[columns].columns)
In [487]:
# Make X and y
y = DF_Scaled_1['Weekly_Sales']
DF_Scaled_1 = DF_Scaled_1.drop('Weekly_Sales', axis = 1)
X = DF_Scaled_1
In [488]:
#Reference Variable
DF_Scaled_1['_intercept'] = 1
In [489]:
# split the dataset into the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)
    
logit = sm.OLS(np.array(y_train), np.array(X_train))
    
    # Fit the model
result = logit.fit()
In [490]:
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.094
Model:                            OLS   Adj. R-squared:                  0.094
Method:                 Least Squares   F-statistic:                     4646.
Date:                Wed, 15 May 2019   Prob (F-statistic):               0.00
Time:                        11:54:10   Log-Likelihood:             8.5648e+05
No. Observations:              402475   AIC:                        -1.713e+06
Df Residuals:                  402465   BIC:                        -1.713e+06
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.0030      0.000      9.709      0.000       0.002       0.004
x2             0.0156      0.000     41.985      0.000       0.015       0.016
x3             0.0056      0.001      3.715      0.000       0.003       0.009
x4             0.0113      0.001     10.244      0.000       0.009       0.013
x5             0.0315      0.001     24.146      0.000       0.029       0.034
x6             0.0073      0.002      4.388      0.000       0.004       0.011
x7             0.0279      0.001     18.962      0.000       0.025       0.031
x8             0.0048      0.000     18.638      0.000       0.004       0.005
x9             0.0065      0.000     18.285      0.000       0.006       0.007
const          0.0080   9.68e-05     82.930      0.000       0.008       0.008
==============================================================================
Omnibus:                   327809.602   Durbin-Watson:                   2.000
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         13650592.088
Skew:                           3.657   Prob(JB):                         0.00
Kurtosis:                      30.577   Cond. No.                         72.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [491]:
# Here we will see if any variable affects weekly_sales.
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
final_df.plot.scatter(x='CPI', y='Weekly_Sales', ax=ax1, alpha = 0.2)
final_df.plot.scatter(x='Fuel_Price', y='Weekly_Sales', ax=ax2, alpha = 0.2)
plt.tight_layout() 
In [492]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
final_df.plot.scatter(x='Promotion1', y='Weekly_Sales', ax=ax1, alpha = 0.2)
final_df.plot.scatter(x='Promotion2', y='Weekly_Sales', ax=ax2, alpha = 0.2)
plt.tight_layout() 
In [493]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
final_df.plot.scatter(x='Promotion3', y='Weekly_Sales', ax=ax1, alpha = 0.2)
final_df.plot.scatter(x='Promotion4', y='Weekly_Sales', ax=ax2, alpha = 0.2)
plt.tight_layout() 
In [494]:
%matplotlib inline
final_df.plot(kind = "scatter", x = 'Promotion5', y = 'Weekly_Sales', alpha = 0.2,  figsize=(20,8))
plt.tight_layout() 

Let us have a look at sales fluctuations for the four holidays, Super Bowl, Labor Day, Thanksgivin and Christmas using time series analysis.

In [631]:
mask = (final_df['Date'] >= '2010-02-05') & (final_df['Date'] <= '2010-02-19')
In [632]:
super_bowl_2010 = final_df.loc[mask]
In [633]:
mask_1 = (final_df['Date'] >= '2011-02-04') & (final_df['Date'] <= '2011-02-18')
In [634]:
super_bowl_2011 = final_df.loc[mask_1]
In [635]:
mask_2 = (final_df['Date'] >= '2012-02-03') & (final_df['Date'] <= '2012-02-17')
mask_3 = (final_df['Date'] >= '2013-02-01') & (final_df['Date'] <= '2013-02-15')
super_bowl_2012 = final_df.loc[mask_2]
super_bowl_2013 = final_df.loc[mask_3]
In [636]:
columns = ['Date', 'Weekly_Sales']
super_bowl_2010 = super_bowl_2010[columns]
super_bowl_2011 = super_bowl_2011[columns]
super_bowl_2012 = super_bowl_2012[columns]
super_bowl_2013 = super_bowl_2013[columns]
In [645]:
%matplotlib inline
# We only have weekly sales data for superbowl year 2010. We  can now only focus on 2010 holidays.
#super_bowl_2010['Weekly_Sales'].plot(linewidth=0.5)
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Weekly_Sales', data = super_bowl_2010)
plt.tight_layout()
plt.show()
In [647]:
%matplotlib inline
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Weekly_Sales', data = super_bowl_2011)
plt.tight_layout()
plt.show()
In [648]:
%matplotlib inline
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Weekly_Sales', data = super_bowl_2012)
plt.tight_layout()
plt.show()
In [650]:
labor_day_mask = (final_df['Date'] >= '2010-09-03') & (final_df['Date'] <= '2010-09-17')
thanksgiving_mask = (final_df['Date'] >= '2010-11-19') & (final_df['Date'] <= '2010-12-01')
christmas_mask = (final_df['Date'] >= '2010-12-24') & (final_df['Date'] <= '2011-01-07')
labor_2010 = final_df.loc[labor_day_mask]
thanksgiving_2010 = final_df.loc[thanksgiving_mask]
christmas_2010 = final_df.loc[christmas_mask]
In [651]:
%matplotlib inline
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Weekly_Sales', data = labor_2010)
plt.tight_layout()
plt.show()
In [652]:
%matplotlib inline
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Weekly_Sales', data = thanksgiving_2010)
plt.tight_layout()
plt.show()
In [653]:
%matplotlib inline
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Weekly_Sales', data = christmas_2010)
plt.tight_layout()
plt.show()

The weekly_sales provided are only for the weeks: 2010/02/12 and 2010/09/10 and these are not very informative values as it's just a linear increase or decrease.

3) Predictive model - imputation

We have an advanced predictive technique to impute missing data, we can fill in the missing data to later help us train more accurate regression/NN models on the dataset. The most advanced method is deep learning imputation using Datawig library.

  • Imputation using Deep Learning (Datawig), link: https://datawig.readthedocs.io/en/latest/index.html The variables that are closely correlated to Promotions and CPI are ['Weekly_Sales', 'Temperature', 'Fuel_Price']. I also calculated the R2_score to see how accurate the imputed values are from the training data.

As DATAWIG is not compatible with Anaconda, I had to write the python script and execute the script using terminal

# Libraries used
import datawig
import pandas as pd
from sklearn.metrics import r2_score as score

# After imputing final_df.csv we clean the data by rounding to 1 decimal place and filling the null values for
# the f2_score metric to not throw an error regarding float size or null values.
df = pd.read_csv('final_df.csv')
df = df.round(1)
df = df.fillna(0)

# Split the dataset into training and test data, 80% to %20
df_train, df_test = datawig.utils.random_split(df, split_ratios=[0.8, 0.2])

# Creating the SimpleImputer object, the input columns are the columns we believe are relevant in calculating the
# Promotions. Output_column is the column we are imputing the values for. The output_path is the output log.
imputer = datawig.SimpleImputer(
    input_columns = ['Weekly_Sales', 'Temperature', 'Fuel_Price',
    'Promotion5', 'Promotion4', 'CPI', 'Promotion2', 'Promotion3'],
    output_column = 'Promotion1',
    output_path = 'imputer_model'
)

# We fit the training data and state the number of epochs 
imputer.fit(train_df=df_train, num_epochs = 50)

# The imputed dataframe is produced.
imputed = imputer.predict(df_test)

# We calculate the f2_score for the Promotion1 and the Imputed Promotion1
f1 = score(imputed['Promotion1'], imputed['Promotion1_imputed'])
print('Promotion1 f2_score: ', f1)

The problem with Datawig is that it only imputes values for one column so I had to repeat the algorithm above 6 times for all the promotions and CPI.

The R2_score for all 6 imputations:

  • Promotion1 f2_score: 0.8796632334818133
  • Promotion2 f2_score: 0.6603054641256494
  • Promotion3 f2_score: 0.4918742528293464
  • Promotion4 f2_score: 0.8915634612104615
  • Promotion5 f2_score: 0.5287041047093568
  • CPI f2_score: 0.35543474289083
In [511]:
# Let us now read in our csv files with imputed columns.
Promotion1_imputed = pd.read_csv('Promotion1_imputed.csv')
Promotion2_imputed = pd.read_csv('Promotion2_imputed.csv')
Promotion3_imputed = pd.read_csv('Promotion3_imputed.csv')
Promotion4_imputed = pd.read_csv('Promotion4_imputed.csv')
Promotion5_imputed = pd.read_csv('Promotion5_imputed.csv')
CPI_imputed = pd.read_csv('CPI_imputed.csv')

We add all the imputed columns to new final_df

In [512]:
Promotion1_imputed['Promotion2_imputed'] = Promotion2_imputed['Promotion2_imputed']
In [513]:
Promotion1_imputed['Promotion3_imputed'] = Promotion3_imputed['Promotion3_imputed']
In [514]:
Promotion1_imputed['Promotion4_imputed'] = Promotion4_imputed['Promotion4_imputed']
In [515]:
Promotion1_imputed['Promotion5_imputed'] = Promotion5_imputed['Promotion5_imputed']
In [516]:
Promotion1_imputed['CPI_imputed'] = CPI_imputed['CPI_imputed']
In [517]:
final_df = Promotion1_imputed
In [518]:
final_df['Date'] =  pd.to_datetime(final_df['Date'],
                              format='%Y-%m-%d')

Exploratory statistics of the new final_df

In [519]:
final_df.describe()
Out[519]:
Unnamed: 0 Unnamed: 0.1 Store Dept Weekly_Sales Temperature Fuel_Price Promotion1 Promotion2 Promotion3 ... Promotion5 CPI Unemployment Size (sq ft) Promotion1_imputed Promotion2_imputed Promotion3_imputed Promotion4_imputed Promotion5_imputed CPI_imputed
count 84314.000000 84314.000000 84314.000000 84314.000000 84314.000000 84314.000000 84314.000000 84314.000000 84314.000000 84314.000000 ... 84314.000000 84314.000000 84314.000000 84314.000000 84314.000000 84314.000000 84314.000000 84314.000000 84314.000000 84314.000000
mean 210695.293308 210695.293308 22.189316 44.185533 15935.248214 60.147537 3.363520 2564.187586 868.011879 480.725934 ... 1637.820726 171.133193 7.959815 136803.747586 2544.707231 788.006721 283.484859 1004.982256 1612.671382 167.410277
std 121565.178299 121565.178299 12.772008 30.502129 22847.804120 18.423663 0.458413 6012.202218 5113.540507 5621.333288 ... 4102.261148 39.159104 1.856633 61075.999005 5332.792499 3956.952610 3384.971164 3662.928095 2937.459257 23.474207
min 2.000000 2.000000 1.000000 1.000000 -4988.940000 -2.060000 2.472000 0.000000 -265.760000 -29.100000 ... 0.000000 126.064000 3.879000 34875.000000 -4625.325035 -23805.312916 -19080.561019 -3528.381807 -2887.803337 53.983860
25% 105807.250000 105807.250000 11.000000 18.000000 2105.597500 46.780000 2.935000 0.000000 0.000000 0.000000 ... 0.000000 132.022667 6.891000 93638.000000 29.789442 -514.410583 -487.320325 -158.023671 -61.477479 151.029090
50% 210538.500000 210538.500000 22.000000 37.000000 7593.705000 62.185000 3.461000 0.000000 0.000000 0.000000 ... 0.000000 182.256960 7.866000 140167.000000 304.071051 154.789129 -1.750390 0.437286 242.647013 164.955086
75% 315599.500000 315599.500000 33.000000 72.000000 20098.727500 74.290000 3.738000 2779.970000 1.750000 4.300000 ... 2156.750000 212.403576 8.567000 202505.000000 3460.881498 1047.966028 458.750220 742.598529 2892.723862 186.146118
max 421560.000000 421560.000000 45.000000 99.000000 649770.180000 100.140000 4.468000 88646.760000 104519.540000 141630.610000 ... 108519.280000 227.232807 14.313000 219622.000000 84277.038888 73354.229227 100960.170556 65524.221179 37100.669184 231.689593

8 rows × 21 columns

In [520]:
final_df = final_df.drop(axis = 1, columns = 'Unnamed: 0')
In [521]:
final_df = final_df.drop(axis = 1, columns = 'Unnamed: 0.1')
In [522]:
final_df.head()
Out[522]:
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price Promotion1 Promotion2 Promotion3 ... CPI Unemployment Type Size (sq ft) Promotion1_imputed Promotion2_imputed Promotion3_imputed Promotion4_imputed Promotion5_imputed CPI_imputed
0 25 24 2012-08-24 4038.41 False 64.53 3.834 7693.13 69.6 50.00 ... 214.956704 7.280 B 128107 8787.754929 806.574409 56.856950 1380.410153 5518.467382 170.801283
1 25 71 2012-07-13 3700.50 False 73.87 3.523 7262.73 45.6 18.01 ... 214.728027 7.280 B 128107 4120.744637 2283.356560 -1142.794812 2229.470544 5680.885331 190.443180
2 5 21 2011-01-14 1013.33 False 37.74 2.983 0.00 0.0 0.00 ... 212.008514 6.634 B 34875 -277.696923 -555.449053 -322.422134 548.093104 -76.214239 155.448629
3 12 13 2011-11-25 44972.99 True 53.25 3.622 5391.83 8.0 63143.29 ... 129.836400 12.890 B 112238 684.982010 -2577.924568 14896.869918 2000.297302 262.989837 127.964379
4 32 31 2011-05-06 1919.06 False 46.65 3.735 0.00 0.0 0.00 ... 192.725201 8.595 A 203007 -213.124242 72.496990 -2475.886981 -371.035826 343.612922 149.990982

5 rows × 22 columns

In [523]:
final_df['Date'].min()
Out[523]:
Timestamp('2010-02-05 00:00:00')
In [524]:
final_df['Date'].max()
Out[524]:
Timestamp('2012-10-26 00:00:00')

Now that we have imputed values, for Promotions and CPI. We can use the extra data imputed to train another model to predict the weekly_sales.

Predictions of Weekly_Sales

  • Weekly_Sales f2_score: 0.5619225329783258
In [570]:
final_including_ws = pd.read_csv('Weekly_Sales_imputed.csv')
In [571]:
final_including_ws.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16862 entries, 0 to 16861
Data columns (total 25 columns):
Unnamed: 0              16862 non-null int64
Unnamed: 0.1            16862 non-null int64
Store                   16862 non-null int64
Dept                    16862 non-null int64
Date                    16862 non-null object
Weekly_Sales            16862 non-null float64
IsHoliday               16862 non-null bool
Temperature             16862 non-null float64
Fuel_Price              16862 non-null float64
Promotion1              16862 non-null float64
Promotion2              16862 non-null float64
Promotion3              16862 non-null float64
Promotion4              16862 non-null float64
Promotion5              16862 non-null float64
CPI                     16862 non-null float64
Unemployment            16862 non-null float64
Type                    16862 non-null object
Size (sq ft)            16862 non-null int64
Promotion1_imputed      16862 non-null float64
Promotion2_imputed      16862 non-null float64
Promotion3_imputed      16862 non-null float64
Promotion4_imputed      16862 non-null float64
Promotion5_imputed      16862 non-null float64
CPI_imputed             16862 non-null float64
Weekly_Sales_imputed    16862 non-null float64
dtypes: bool(1), float64(17), int64(5), object(2)
memory usage: 3.1+ MB
In [572]:
final_including_ws = final_including_ws .drop(axis = 1, columns = 'Unnamed: 0')
final_including_ws = final_including_ws .drop(axis = 1, columns = 'Unnamed: 0.1')
In [573]:
final_including_ws.head()
Out[573]:
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price Promotion1 Promotion2 Promotion3 ... Unemployment Type Size (sq ft) Promotion1_imputed Promotion2_imputed Promotion3_imputed Promotion4_imputed Promotion5_imputed CPI_imputed Weekly_Sales_imputed
0 34 67 2010-12-31 4555.33 True 34.11 2.955 0.00 0.0 0.00 ... 10.210 A 158114 -23.376474 -1607.985551 -353.826430 -301.255218 -80.106949 154.367388 5862.466556
1 43 97 2012-06-15 20011.29 False 87.75 3.393 402.48 0.0 0.65 ... 9.575 C 41062 2095.467932 1228.118108 688.213762 123.822650 554.509723 219.549703 19622.972464
2 10 29 2011-09-30 12505.63 False 82.27 3.877 0.00 0.0 0.00 ... 8.257 B 126512 278.288496 -1401.248725 317.925636 -157.032943 519.280497 139.631219 13679.830948
3 40 34 2010-04-02 9882.80 False 41.39 2.826 0.00 0.0 0.00 ... 5.435 A 155083 160.640149 -131.224040 -700.324773 36.995891 -18.198700 159.889929 12142.983804
4 26 92 2010-02-26 71127.20 False 27.32 2.753 0.00 0.0 0.00 ... 8.488 A 152513 249.207386 -3010.555324 26.838928 -10.333324 -2.116164 149.428467 59436.916811

5 rows × 23 columns

In [574]:
final_including_ws['Date'] =  pd.to_datetime(final_including_ws['Date'],
                              format='%Y-%m-%d')
In [575]:
final_including_ws['Date'].min()
Out[575]:
Timestamp('2010-02-05 00:00:00')
In [576]:
final_including_ws['Date'].max()
Out[576]:
Timestamp('2012-10-26 00:00:00')
In [577]:
final_including_ws = final_including_ws.sort_values(by = 'Date')
In [533]:
 final_including_ws.hist(figsize=(20,20))
Out[533]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a3d74e2e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a35834a58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a3461f1d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a3b022160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a2f519710>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a3b00eef0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a2b658208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a3aec29e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a3aee1048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a34632d30>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a245cb400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a2b7562e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a2afe6828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a3b000dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a3ae48dd8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a3ae65dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a27708358>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a277050b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a2f512898>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a3455c7b8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a3456da58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a2b66d748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a1f147358>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a33e80358>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a33e737b8>]], dtype=object)

4) Predictive model 2: regression analysis

In [757]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import label
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
In [758]:
final_df_v2 = pd.read_csv("2013_sales_complete.csv")
In [759]:
final_df_v2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13106 entries, 0 to 13105
Data columns (total 26 columns):
Unnamed: 0                      13106 non-null int64
Unnamed: 0.1                    13106 non-null int64
CPI                             13106 non-null float64
CPI_imputed                     13106 non-null float64
Date                            13106 non-null object
Dept                            13106 non-null int64
Fuel_Price                      13106 non-null float64
IsHoliday                       13106 non-null bool
Promotion1                      13106 non-null float64
Promotion1_imputed              13106 non-null float64
Promotion2                      13106 non-null float64
Promotion2_imputed              13106 non-null float64
Promotion3                      13106 non-null float64
Promotion3_imputed              13106 non-null float64
Promotion4                      13106 non-null float64
Promotion4_imputed              13106 non-null float64
Promotion5                      13106 non-null float64
Promotion5_imputed              13106 non-null float64
Size (sq ft)                    13106 non-null int64
Store                           13106 non-null int64
Temperature                     13106 non-null float64
Type                            13106 non-null object
Unemployment                    13106 non-null float64
Weekly_Sales                    13106 non-null float64
Weekly_Sales to be predicted    13106 non-null int64
Weekly_Sales_imputed            13106 non-null float64
dtypes: bool(1), float64(17), int64(6), object(2)
memory usage: 2.5+ MB
In [760]:
final_df_v2 = final_df_v2.fillna(0)
In [761]:
final_df_v2 = final_df_v2.drop('Unnamed: 0', axis = 1)
final_df_v2 = final_df_v2.drop('Unnamed: 0.1', axis = 1)
In [762]:
final_df_v2['Date'] =  pd.to_datetime(final_df_v2['Date'],
                              format='%Y-%m-%d')
In [763]:
final_df_v2['Date'].min()
Out[763]:
Timestamp('2010-02-05 00:00:00')
In [764]:
final_df_v2['Date'].max()
Out[764]:
Timestamp('2013-07-26 00:00:00')
In [765]:
columns_1 = ['Store', 'Fuel_Price', 'Dept','Weekly_Sales',  'Temperature', 'Promotion1', 'Promotion2', 'Promotion3', 
          'Promotion4', 'Promotion5', 'Promotion1_imputed', 'Promotion2_imputed', 'Promotion3_imputed', 
          'Promotion4_imputed', 'Promotion5_imputed', 'Unemployment', 'Size (sq ft)', 'Weekly_Sales_imputed', 'Weekly_Sales to be predicted', 'CPI_imputed', 'CPI']
In [766]:
# Use min max scaler
scaler = MinMaxScaler()
DF_Scaled = scaler.fit_transform(final_df_v2[columns_1])
DF_Scaled = pd.DataFrame(data=DF_Scaled, columns=final_df_v2[columns_1].columns)
In [767]:
DF_Scaled = DF_Scaled.fillna(0)
In [768]:
DF_Scaled.head()
Out[768]:
Store Fuel_Price Dept Weekly_Sales Temperature Promotion1 Promotion2 Promotion3 Promotion4 Promotion5 ... Promotion2_imputed Promotion3_imputed Promotion4_imputed Promotion5_imputed Unemployment Size (sq ft) Weekly_Sales_imputed Weekly_Sales to be predicted CPI_imputed CPI
0 0.272727 0.0 0.336735 0.000473 0.02034 0.0 0.001834 0.000007 0.0 0.0 ... 0.134081 0.196865 0.028848 0.08866 0.0 0.0 0.101407 0.0 0.0 0.0
1 0.272727 0.0 0.020408 0.000473 0.02034 0.0 0.001834 0.000007 0.0 0.0 ... 0.134081 0.196865 0.028848 0.08866 0.0 0.0 0.105970 0.0 0.0 0.0
2 0.068182 0.0 0.030612 0.000473 0.02034 0.0 0.001834 0.000007 0.0 0.0 ... 0.134081 0.196865 0.028848 0.08866 0.0 0.0 0.106690 0.0 0.0 0.0
3 0.227273 0.0 0.010204 0.000473 0.02034 0.0 0.001834 0.000007 0.0 0.0 ... 0.134081 0.196865 0.028848 0.08866 0.0 0.0 0.106773 0.0 0.0 0.0
4 0.295455 0.0 0.153061 0.000473 0.02034 0.0 0.001834 0.000007 0.0 0.0 ... 0.134081 0.196865 0.028848 0.08866 0.0 0.0 0.103761 0.0 0.0 0.0

5 rows × 21 columns

In [769]:
# Make X and y
y = DF_Scaled['Weekly_Sales_imputed']
DF_Scaled = DF_Scaled.drop('Weekly_Sales_imputed', axis = 1)
X = DF_Scaled
In [770]:
#Reference Variable
DF_Scaled['_intercept'] = 1
In [771]:
# split the dataset into the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)
imp.fit_transform(X_train)
logit = sm.OLS(np.array(y_train), np.array(X_train))
    
    # Fit the model
result = logit.fit()
In [1]:
from IPython.display import Image
Image("OLS_results.png")
Out[1]:
In [773]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)  
regressor.fit(X_train, y_train)  
y_pred = regressor.predict(X_test)  
In [774]:
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred))) 
Mean Absolute Error: 0.00554390305146
Mean Squared Error: 0.000322561653259
Root Mean Squared Error: 0.0179600014827
In [727]:
%matplotlib inline
"""
We will now use pairplot from the Seaborn library which allows us to draw statistical plots based 
on Matplotlib. Once we visualise the data it would be easier to detect outliers.
"""

# This is the subset of columns I will be focusing on, these can be changed to include other columns as well.
columns_2 = ['Fuel_Price', 'Dept','Weekly_Sales',  'Temperature', 'Promotion1', 'Promotion2', 'Promotion3', 
          'Promotion4', 'Promotion5', 'Promotion1_imputed', 'Promotion2_imputed', 'Promotion3_imputed', 
          'Promotion4_imputed', 'Promotion5_imputed', 'Size (sq ft)', 'Weekly_Sales to be predicted', 'CPI_imputed']

flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
sns.set_palette(flatui)
sns.pairplot(final_df_v2[columns_2], height = 5.0)
plt.tight_layout()
plt.show()
In [781]:
columns_3 = ['Date','Fuel_Price', 'Dept','Weekly_Sales',  'Temperature', 'Promotion1', 'Promotion2', 'Promotion3', 
          'Promotion4', 'Promotion5', 'Promotion1_imputed', 'Promotion2_imputed', 'Promotion3_imputed', 
          'Promotion4_imputed', 'Promotion5_imputed', 'Size (sq ft)', 'Weekly_Sales to be predicted', 'Weekly_Sales_imputed', 'CPI_imputed']
In [780]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
sns.lineplot(x = 'Date', y ='Weekly_Sales_imputed', data = final_df_v2, ax=ax1)
sns.lineplot(x = 'Dept', y ='Weekly_Sales_imputed', data = final_df_v2, ax=ax2)
plt.tight_layout()
plt.show()
In [782]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
sns.lineplot(x = 'Temperature', y ='Weekly_Sales_imputed', data = final_df_v2, ax=ax1)
sns.lineplot(x = 'Size (sq ft)', y ='Weekly_Sales_imputed', data = final_df_v2, ax=ax2)
plt.tight_layout()
plt.show()
In [732]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
sns.lineplot(x = 'Date', y ='Promotion1_imputed', data = final_df_v2, ax=ax1)
sns.lineplot(x = 'Date', y ='Promotion2_imputed', data = final_df_v2, ax=ax2)
plt.tight_layout() 
plt.show()
In [733]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
sns.lineplot(x = 'Date', y ='Promotion3_imputed', data = final_df_v2, ax=ax1)
sns.lineplot(x = 'Date', y ='Promotion4_imputed', data = final_df_v2, ax=ax2)
plt.tight_layout() 
plt.show()
In [734]:
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
sns.lineplot(x = 'Store', y ='Weekly_Sales_imputed', data = final_df_v2, ax=ax1)
sns.lineplot(x = 'Dept', y ='Weekly_Sales_imputed', data = final_df_v2, ax=ax2)
plt.tight_layout() 
plt.show()